I am using formula 1 dataset again to plot the most successful drivers
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import numpy as np
df_driver = pd.read_csv("GrandPrix_drivers_details_1950_to_2022.csv", encoding='ISO-8859-1')
df_driver
| Pos | Driver | Nationality | Car | PTS | year | |
|---|---|---|---|---|---|---|
| 0 | 1 | Nino Farina FAR | ITA | Alfa Romeo | 30.0 | 1950 |
| 1 | 2 | Juan Manuel Fangio FAN | ARG | Alfa Romeo | 27.0 | 1950 |
| 2 | 3 | Luigi Fagioli FAG | ITA | Alfa Romeo | 24.0 | 1950 |
| 3 | 4 | Louis Rosier ROS | FRA | Talbot-Lago | 13.0 | 1950 |
| 4 | 5 | Alberto Ascari ASC | ITA | Ferrari | 11.0 | 1950 |
| ... | ... | ... | ... | ... | ... | ... |
| 1613 | 18 | Zhou Guanyu ZHO | CHN | Alfa Romeo Ferrari | 6.0 | 2022 |
| 1614 | 19 | Alexander Albon ALB | THA | Williams Mercedes | 4.0 | 2022 |
| 1615 | 20 | Nicholas Latifi LAT | CAN | Williams Mercedes | 2.0 | 2022 |
| 1616 | 21 | Nyck De Vries DEV | NED | Williams Mercedes | 2.0 | 2022 |
| 1617 | 22 | Nico Hulkenberg HUL | GER | Aston Martin Aramco Mercedes | 0.0 | 2022 |
1618 rows × 6 columns
#calculating drivers active years
active_years = df_driver.groupby('Driver')['year'].agg(['min', 'max']).reset_index()
active_years['ActiveYears'] = active_years.apply(lambda row: f"{row['min']}-{row['max']}", axis=1)
active_years
| Driver | min | max | ActiveYears | |
|---|---|---|---|---|
| 0 | Adrian Sutil SUT | 2007 | 2014 | 2007-2014 |
| 1 | Aguri Suzuki SUZ | 1990 | 1995 | 1990-1995 |
| 2 | Alain Prost PRO | 1980 | 1993 | 1980-1993 |
| 3 | Alan Jones JON | 1975 | 1986 | 1975-1986 |
| 4 | Alan Brown BRO | 1952 | 1952 | 1952-1952 |
| ... | ... | ... | ... | ... |
| 391 | Yuji Ide IDE | 2006 | 2006 | 2006-2006 |
| 392 | Yuki Tsunoda TSU | 2021 | 2022 | 2021-2022 |
| 393 | Yves Giraud Cabantous CAB | 1950 | 1951 | 1950-1951 |
| 394 | Zhou Guanyu ZHO | 2022 | 2022 | 2022-2022 |
| 395 | Zsolt Baumgartner BAU | 2003 | 2004 | 2003-2004 |
396 rows × 4 columns
df_race = pd.read_csv("GrandPrix_races_details_1950_to_2022.csv", encoding='ISO-8859-1')
df_race.head()
| Grand Prix | Date | Winner | Car | Laps | Time | year | |
|---|---|---|---|---|---|---|---|
| 0 | Great Britain | 13 May 1950 | Nino Farina FAR | Alfa Romeo | 70.0 | 2:13:23.600 | 1950.0 |
| 1 | Monaco | 21 May 1950 | Juan Manuel Fangio FAN | Alfa Romeo | 100.0 | 3:13:18.700 | 1950.0 |
| 2 | Indianapolis 500 | 30 May 1950 | Johnnie Parsons PAR | Kurtis Kraft Offenhauser | 138.0 | 2:46:55.970 | 1950.0 |
| 3 | Switzerland | 04 Jun 1950 | Nino Farina FAR | Alfa Romeo | 42.0 | 2:02:53.700 | 1950.0 |
| 4 | Belgium | 18 Jun 1950 | Juan Manuel Fangio FAN | Alfa Romeo | 35.0 | 2:47:26.000 | 1950.0 |
df_merged = pd.merge(df_race[['Winner', 'Car', 'Date']], df_driver[['Driver', 'Car', 'PTS', 'year']],
left_on=['Winner', 'Car'], right_on=['Driver', 'Car'])
df_merged = df_merged[['Winner', 'Car', 'Date', 'PTS']]
print(df_merged.head())
Winner Car Date PTS 0 Nino Farina FAR Alfa Romeo 13 May 1950 30.0 1 Nino Farina FAR Alfa Romeo 13 May 1950 19.0 2 Nino Farina FAR Alfa Romeo 04 Jun 1950 30.0 3 Nino Farina FAR Alfa Romeo 04 Jun 1950 19.0 4 Nino Farina FAR Alfa Romeo 03 Sep 1950 30.0
df_merged['Year'] = pd.to_datetime(df_merged['Date'], format='mixed',dayfirst=True).dt.year
df_merged.head()
# Calculate the total PTS for a driver in each year
#df_total_pts = df_merged.groupby(['Winner', 'Year'])['PTS'].sum().reset_index()
#df_total_pts.head()
| Winner | Car | Date | PTS | Year | |
|---|---|---|---|---|---|
| 0 | Nino Farina FAR | Alfa Romeo | 13 May 1950 | 30.0 | 1950 |
| 1 | Nino Farina FAR | Alfa Romeo | 13 May 1950 | 19.0 | 1950 |
| 2 | Nino Farina FAR | Alfa Romeo | 04 Jun 1950 | 30.0 | 1950 |
| 3 | Nino Farina FAR | Alfa Romeo | 04 Jun 1950 | 19.0 | 1950 |
| 4 | Nino Farina FAR | Alfa Romeo | 03 Sep 1950 | 30.0 | 1950 |
#sum up the column of points for each driver grouping by years
df_merged['PTS'] = df_merged.groupby(['Winner', 'Year'])['PTS'].transform('sum')
# Count the occurrence of each winner in each year
df_merged['Wins'] = df_merged.groupby(['Winner', 'Year'])['Winner'].transform('count')
# Select the final columns
df_final = df_merged[['Winner', 'Year', 'PTS', 'Wins', 'Car']].drop_duplicates()
df_final.head()
| Winner | Year | PTS | Wins | Car | |
|---|---|---|---|---|---|
| 0 | Nino Farina FAR | 1950 | 147.0 | 6 | Alfa Romeo |
| 6 | Nino Farina FAR | 1951 | 49.0 | 2 | Alfa Romeo |
| 8 | Juan Manuel Fangio FAN | 1950 | 174.0 | 6 | Alfa Romeo |
| 14 | Juan Manuel Fangio FAN | 1951 | 174.0 | 6 | Alfa Romeo |
| 20 | Johnnie Parsons PAR | 1950 | 9.0 | 1 | Kurtis Kraft Offenhauser |
df_final.sort_values('Year', inplace=True)
# Convert 'Wins' column to cumulative sum
df_final['Wins'] = df_final.groupby('Winner')['Wins'].cumsum()
df_final['PTS'] = df_final.groupby('Winner')['PTS'].cumsum()
# Select the final columns
df_final = df_final[['Winner', 'Year', 'PTS', 'Wins', 'Car']].drop_duplicates()
df_final.head()
| Winner | Year | PTS | Wins | Car | |
|---|---|---|---|---|---|
| 0 | Nino Farina FAR | 1950 | 147.0 | 6 | Alfa Romeo |
| 8 | Juan Manuel Fangio FAN | 1950 | 174.0 | 6 | Alfa Romeo |
| 20 | Johnnie Parsons PAR | 1950 | 9.0 | 1 | Kurtis Kraft Offenhauser |
| 6 | Nino Farina FAR | 1951 | 196.0 | 8 | Alfa Romeo |
| 14 | Juan Manuel Fangio FAN | 1951 | 348.0 | 12 | Alfa Romeo |
#check for min and macx values to set the x and y range in plot
min_pts = df_final['Wins'].min()
min_pts
1
max_wins = df_final['Wins'].max()
max_wins
946
I have to use some kind of normalization because for the values of column PTS,there is a wide range of values from 6 to 4225925. This causes the scatter plot bubbles to appear very tiny. I tried using log as well as it can help in visually compressing the data, but that didnt help either.
from sklearn.preprocessing import MinMaxScaler
# Perform Min-Max normalization on the 'Wins' column
scaler = MinMaxScaler(feature_range=(0, 1))
df_final['PTS_normalized'] = scaler.fit_transform(df_final['PTS'].values.reshape(-1, 1))
df_final['Wins_normalized'] = scaler.fit_transform(df_final['Wins'].values.reshape(-1, 1))
maxwins=df_final['Wins'].max()
maxwins
946
# Create the scatter plot with the normalized 'Wins' column
fig = px.scatter(df_final, x='PTS_normalized', y='Wins_normalized', animation_frame='Year', animation_group='Winner',
size='PTS_normalized', color='Winner', hover_name='Car', range_y=[0, 1], range_x=[0, 1])
fig.update_layout(xaxis_title="PTS", yaxis_title="Wins", legend_title="Winner")
# Set figure layout
#fig.update_layout(title='Drivers with Most Wins and PTS Over the Years',
# xaxis_title='Total points', yaxis_title='Year')
# Show the figure
fig.show()
You can see very tiny scatter plots from 1950 to 1990, then we clearly some bubbles, two of them very big, these drivers are Michael Schumacher and Lewis Hamilton, they almost standout very loudly from the rest of the drivers.
I will now focus on the data after year 1990. And also zoom in to view the left bottom corner of the previous visual.
filtered_data = df_final[df_final['Year'] >= 1990]
filtered_data.head()
| Winner | Year | PTS | Wins | Car | PTS_normalized | Wins_normalized | |
|---|---|---|---|---|---|---|---|
| 1712 | Thierry Boutsen BOU | 1990 | 213.0 | 6 | Williams Renault | 0.000677 | 0.005291 |
| 1743 | Nelson Piquet PIQ | 1990 | 2678.0 | 64 | Benetton Ford | 0.008745 | 0.066667 |
| 1727 | Riccardo Patrese PAT | 1990 | 208.0 | 7 | Williams Renault | 0.000661 | 0.006349 |
| 1717 | Alain Prost PRO | 1990 | 8511.5 | 135 | Ferrari | 0.027837 | 0.141799 |
| 1706 | Nigel Mansell MAN | 1990 | 2331.0 | 45 | Ferrari | 0.007609 | 0.046561 |
#also noticed that legend doesnt display all driver, probably cause the high number of drivers. Lets take only the top 15 drivers for the next visual
unique_drivers = filtered_data['Winner'].nunique()
unique_drivers
#figured out that the legend was missing drivers because of the same color being assigned to more than one drivers
43
top_drivers = filtered_data['Winner'].value_counts().head(20).index.tolist()
df_top_15 = filtered_data[filtered_data['Winner'].isin(top_drivers)]
df_top_15
| Winner | Year | PTS | Wins | Car | PTS_normalized | Wins_normalized | |
|---|---|---|---|---|---|---|---|
| 1706 | Nigel Mansell MAN | 1990 | 2331.0 | 45 | Ferrari | 0.007609 | 0.046561 |
| 1622 | Ayrton Senna SEN | 1990 | 7966.0 | 110 | McLaren Honda | 0.026051 | 0.115344 |
| 1794 | Gerhard Berger BER | 1991 | 880.0 | 28 | McLaren Honda | 0.002860 | 0.028571 |
| 1652 | Ayrton Senna SEN | 1991 | 10584.0 | 145 | McLaren Honda | 0.034619 | 0.152381 |
| 1749 | Nigel Mansell MAN | 1991 | 3296.0 | 60 | Williams Renault | 0.010767 | 0.062434 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 4945 | Sergio Perez PER | 2021 | 367.0 | 3 | Red Bull Racing Honda | 0.001181 | 0.002116 |
| 4948 | Daniel Ricciardo RIC | 2021 | 3646.0 | 20 | McLaren Mercedes | 0.011913 | 0.020106 |
| 4654 | Lewis Hamilton HAM | 2021 | 305558.0 | 946 | Mercedes | 1.000000 | 1.000000 |
| 4950 | Max Verstappen VER | 2022 | 21790.5 | 68 | Red Bull Racing RBPT | 0.071296 | 0.070899 |
| 4965 | Sergio Perez PER | 2022 | 977.0 | 5 | Red Bull Racing RBPT | 0.003178 | 0.004233 |
122 rows × 7 columns
# Set the color_discrete_sequence with a list of colors
color_sequence = px.colors.qualitative.Set1[:30] # Adjust the number of colors as needed
fig = px.scatter(filtered_data, x='PTS_normalized', y='Wins_normalized', animation_frame='Year', animation_group='Winner',
size='PTS_normalized', color='Winner', hover_name='Car', range_y=[0, 1], range_x=[0, 1],
labels={'PTS_normalized': 'PTS', 'Wins_normalized': 'Wins'},
title='Top 15 Drivers: PTS vs Wins (1990-2020)',
color_discrete_sequence=color_sequence)
fig.update_layout(xaxis_title="PTS", yaxis_title="Wins", legend_title="Winner")
fig.show()
Now, the data is visible as compared to the first one. we can pause and see the bubbles for details by hovering over it. I am still facing issues while assigning colors to the drivers.
#Most successful drivers since 1950 -2020
df_winners = df_race.groupby(['Winner', 'Car']).size().reset_index(name='Wins')
df_winners = df_winners.sort_values(by=['Wins'], ascending=False)
df_winners = df_winners[df_winners['Wins'] >= 7]
fig = px.scatter(df_winners, x="Wins", y="Winner", color="Car", size="Wins", hover_name="Car",
hover_data={"Wins": True},
title="Most Successful Drivers and Teams in Terms of Race Wins",
labels={"Wins": "Race Wins", "Winner": "Driver"})
fig.update_layout(xaxis_title="Race Wins", yaxis_title="Driver", legend_title="Car")
fig.show()
fig.write_html('top_drivers_animation.html', auto_open=False) # Save as HTML file
constructors= pd.read_csv('constructors.csv')
constructors.head()
constructors= constructors[['name','nationality']]
count= constructors.groupby( [ "nationality","name"] ).size().reset_index()
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
constructors_count = constructors['nationality'].value_counts()
data = [go.Bar(
x = constructors_count.index,
y = constructors_count.values,
# marker = dict(color = random_colors(25))
)]
layout = dict(
title= "Constructors by Country "
)
fig = dict( data=data, layout=layout )
iplot( fig, validate=False )
Most race constructor teams have been british, followed by American, Italian and then the number really drops from French, German and the others following behind.
constructors_count = constructors['nationality'].value_counts()
trace = go.Pie(labels=constructors_count.index, values=constructors_count.values, hole=0.6,textinfo= "none")
layout = go.Layout(
title='Percentage of Constructors by Nationality'
)
fig = go.Figure(data=[trace], layout=layout)
iplot(fig, filename="plotting-library")
Plotting the very recent British GrandProx results which happened this sunday on 9th July 2023. Visualing the fastest cars during each laps, focusing on the top 5 drivers.
laptimes = pd.read_csv('lap_times.csv')
laptimes.head()
| raceId | driverId | lap | position | time | milliseconds | |
|---|---|---|---|---|---|---|
| 0 | 841 | 20 | 1 | 1 | 1:38.109 | 98109 |
| 1 | 841 | 20 | 2 | 1 | 1:33.006 | 93006 |
| 2 | 841 | 20 | 3 | 1 | 1:32.713 | 92713 |
| 3 | 841 | 20 | 4 | 1 | 1:32.803 | 92803 |
| 4 | 841 | 20 | 5 | 1 | 1:32.342 | 92342 |
drivers = pd.read_csv('drivers.csv')
drivers.head()
| driverId | driverRef | number | code | forename | surname | dob | nationality | url | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | hamilton | 44 | HAM | Lewis | Hamilton | 1985-01-07 | British | http://en.wikipedia.org/wiki/Lewis_Hamilton |
| 1 | 2 | heidfeld | \N | HEI | Nick | Heidfeld | 1977-05-10 | German | http://en.wikipedia.org/wiki/Nick_Heidfeld |
| 2 | 3 | rosberg | 6 | ROS | Nico | Rosberg | 1985-06-27 | German | http://en.wikipedia.org/wiki/Nico_Rosberg |
| 3 | 4 | alonso | 14 | ALO | Fernando | Alonso | 1981-07-29 | Spanish | http://en.wikipedia.org/wiki/Fernando_Alonso |
| 4 | 5 | kovalainen | \N | KOV | Heikki | Kovalainen | 1981-10-19 | Finnish | http://en.wikipedia.org/wiki/Heikki_Kovalainen |
results = pd.read_csv('results.csv')
results.head()
| resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616 | 39 | 2 | 1:27.452 | 218.300 | 1 |
| 1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094 | 41 | 3 | 1:27.739 | 217.586 | 1 |
| 2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | 58 | +8.163 | 5698779 | 41 | 5 | 1:28.090 | 216.719 | 1 |
| 3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | 58 | +17.181 | 5707797 | 58 | 7 | 1:28.603 | 215.464 | 1 |
| 4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | 58 | +18.014 | 5708630 | 43 | 1 | 1:27.418 | 218.385 | 1 |
# Merge the laptimes and drivers datasets based on driverId
merged_data = pd.merge(laptimes, drivers, on='driverId')
merged_data['time_seconds'] = merged_data['milliseconds'] / 1000
merged_data=merged_data[merged_data['raceId'] == 1108]
merged_filtered_data = merged_data[(merged_data['time_seconds'] >= 92) & (merged_data['time_seconds'] <= 95)]
# Group the data by driverId and create a line plot for each driver
plt.figure(figsize=(12, 6))
grouped_data = merged_filtered_data.groupby('driverId')
for driverId, group in grouped_data:
driver_surname = group['surname'].iloc[0]
plt.plot(group['lap'],group['time_seconds'], label=driver_surname)
# Set labels and title
plt.ylabel('Time (seconds)')
plt.xlabel('Lap')
plt.title('Lap Times (92-95 seconds)')
plt.legend(title='Driver',loc='best')
# Show the plot
plt.show()
A lot of lines,as the plot is trying to visualise laptime for 20 drivers currently, lets focus on the fastest 5 drivers from the set.
# Filter the data for raceId 1108
race_1108 = results[results['raceId'] == 1108]
# Sort the data by positionOrder in ascending order
sorted_data = race_1108.sort_values('positionOrder')
# Get the top 5 drivers
top_5_drivers = sorted_data.head(5)
# Merge the datasets based on the common columns
merged_data = pd.merge(merged_data,top_5_drivers, on=['raceId', 'driverId'])
merged_data
| raceId | driverId | lap | position_x | time_x | milliseconds_x | driverRef | number_x | code | forename | ... | positionOrder | points | laps | time_y | milliseconds_y | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1108 | 1 | 1 | 8 | 1:40.664 | 100664 | hamilton | 44 | HAM | Lewis | ... | 3 | 15.0 | 52 | +6.783 | 5123721 | 43 | 3 | 1:30.545 | 234.221 | 1 |
| 1 | 1108 | 1 | 2 | 8 | 1:34.361 | 94361 | hamilton | 44 | HAM | Lewis | ... | 3 | 15.0 | 52 | +6.783 | 5123721 | 43 | 3 | 1:30.545 | 234.221 | 1 |
| 2 | 1108 | 1 | 3 | 8 | 1:33.662 | 93662 | hamilton | 44 | HAM | Lewis | ... | 3 | 15.0 | 52 | +6.783 | 5123721 | 43 | 3 | 1:30.545 | 234.221 | 1 |
| 3 | 1108 | 1 | 4 | 8 | 1:33.086 | 93086 | hamilton | 44 | HAM | Lewis | ... | 3 | 15.0 | 52 | +6.783 | 5123721 | 43 | 3 | 1:30.545 | 234.221 | 1 |
| 4 | 1108 | 1 | 5 | 8 | 1:33.718 | 93718 | hamilton | 44 | HAM | Lewis | ... | 3 | 15.0 | 52 | +6.783 | 5123721 | 43 | 3 | 1:30.545 | 234.221 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 255 | 1108 | 857 | 48 | 4 | 1:31.159 | 91159 | piastri | 81 | PIA | Oscar | ... | 4 | 12.0 | 52 | +7.776 | 5124714 | 41 | 4 | 1:30.850 | 233.435 | 1 |
| 256 | 1108 | 857 | 49 | 4 | 1:31.422 | 91422 | piastri | 81 | PIA | Oscar | ... | 4 | 12.0 | 52 | +7.776 | 5124714 | 41 | 4 | 1:30.850 | 233.435 | 1 |
| 257 | 1108 | 857 | 50 | 4 | 1:31.752 | 91752 | piastri | 81 | PIA | Oscar | ... | 4 | 12.0 | 52 | +7.776 | 5124714 | 41 | 4 | 1:30.850 | 233.435 | 1 |
| 258 | 1108 | 857 | 51 | 4 | 1:31.305 | 91305 | piastri | 81 | PIA | Oscar | ... | 4 | 12.0 | 52 | +7.776 | 5124714 | 41 | 4 | 1:30.850 | 233.435 | 1 |
| 259 | 1108 | 857 | 52 | 4 | 1:30.941 | 90941 | piastri | 81 | PIA | Oscar | ... | 4 | 12.0 | 52 | +7.776 | 5124714 | 41 | 4 | 1:30.850 | 233.435 | 1 |
260 rows × 31 columns
merged_data['time_seconds'] = merged_data['milliseconds_x'] / 1000
merged_filtered_data = merged_data[(merged_data['time_seconds'] >= 92) & (merged_data['time_seconds'] <= 95)]
# Group the data by driverId and create a line plot for each driver
grouped_data = merged_filtered_data.groupby('driverId')
grouped_data
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000258B6A82F40>
plt.style.use('dark_background')
plt.figure(figsize=(12, 6))
for driverId, group in grouped_data:
driver_surname = group['surname'].iloc[0] # Get the surname of the driver
plt.plot(group['lap'],group['time_seconds'], label=driver_surname)
# Set labels and title
plt.ylabel('Time (seconds)')
plt.xlabel('Lap')
plt.title('British GP Lap Times (92-95 seconds)')
plt.legend(title='Driver')
# Show the plot
plt.show()
So we set the x axis with a time range of 92-95 seconds as it is the average time taken by the drivers for covering each lap. We see that Verstappen is very low on the graph, meaning he takes less time than others to cover the laps implying that he is the fastest among them all. Also, notice that Norris was faster than everybody else in the beginning of the race, giving Verstappen a good challenge for most part of the race until the 20th lap, then his car becomes slower and Hamilton gains speed from 26th lap after being the slowest among these 5 drivers for starting part of the race.